Introduction


Data Preparation

  1. Checkins (for the file dataset_TIST2015_Checkins.txt).
  2. POI (for the file dataset_TIST2015_POIs.txt).
  3. City (for the file dataset_TIST2015_Cities.txt).

    Checkins table contains the information about the check-in actions, including User_Id, Venue_ID and Chk_time. Chk_time is a calculated column that combines the UTC_Time and Offset columns from the original data-set to calculate the local time of the check-in for when analyzing the data for the specified country.

  SELECT T.User_ID,T.Venue_ID,
  DATEADD(MINUTE,CAST(Timezone_offset as int),CAST(Yr+'-'+mon+'-'+ltrim(dy)+' '+tim as datetime)) Chk_time
  FROM (SELECT chk.*,
  SUBSTRING(UTC_time, 27, CHARINDEX(' ',UTC_time)) Yr,
  case SUBSTRING(UTC_time, 5, CHARINDEX(' ',UTC_time))
  when 'Jan' then '01'
  when 'Feb' then '02'
  when 'Mar' then '03'
  when 'Apr' then '04'
  when 'May' then '05'
  when 'Jun' then '06'
  when 'Jul' then '07'
  when 'Aug' then '08'
  when 'Sep' then '09'
  when 'Oct' then '10'
  when 'Nov' then '11'
  when 'Dec' then '12'
  end as mon ,
  SUBSTRING(UTC_time, 8, CHARINDEX(' ', UTC_time)-1) dy,
  substring(UTC_time,11,CHARINDEX(' +0',UTC_time)-11) tim
  FROM dbo.Checkins chk
  ) T

Then the result should replace the dbo.Checkins table.

chk_poi_country %>% head() %>% select(User_ID, Venue_ID, Venue_Category_Name)
## # A tibble: 6 x 3
##   User_ID Venue_ID                 Venue_Category_Name
##     <int> <chr>                    <chr>              
## 1  141635 4e1b32fae4cdb140739250a3 Bar                
## 2    2485 4f110e51e4b01eefedc937c9 Skate Park         
## 3     295 4f0fd5a8e4b03856eeb6c8cb Cosmetics Shop     
## 4   33134 4f51be1de4b09f0a24711246 Zoo                
## 5   52935 4b7b884ff964a5207d662fe3 Noodle House       
## 6   11899 4c4324253735be9a18f71aa4 Music Venue


Since the check-ins in the data-set are on different levels from the venues hierarchy, and the hierarchy of the venues was not extracted or provided with the data-set, a csv data-set of the hierarchies was downloaded from Zoltrix’s Github to be able to unify the levels of venues for the analysis.
The csv file “venue_subcategories.csv” contains three columns “venue”, “venue_category” and “level”. Each venue is stored with each of it’s parent levels in a separate row.


Since the dataset was collected for the years 2012 and 2013, while the venue categories hierarchy file was recently collected there were some discrepancies between the old and the recent hierarchy. To handle these discrepancies a new file,“venue_subcategories_complementary.csv”, was manually created correcting the old values to be matching with the new hierarchy.

Also the highest level categories were added to the list inside the code as they were only included in the file as parents but never as available categories.

# hanlde main categoies (present only as parent category)
main_venue <- venue_sub_cat %>% filter(level == 0) %>% select(venue_category) %>% unique()
main_venue<- add_column(main_venue, venue = main_venue$venue_category, .before = "venue_category")
main_venue$level <- 0

venue_sub_cat <- rbind(venue_sub_cat,main_venue)


Then 2 columns are added to the data “venue_category_lvl_one” and “venue_category” which add the level zero parent of the checked-in venue and the level one parent
A sample data after including the hierarchy columns

chk_poi_country %>% head() %>% select(User_ID, Venue_ID, Venue_Category_Name, venue_category_lvl_one, venue_category)
## # A tibble: 6 x 5
##   User_ID Venue_ID     Venue_Category_N~ venue_category_l~ venue_category 
##     <int> <chr>        <chr>             <fct>             <fct>          
## 1  141635 4e1b32fae4c~ Bar               Bar               Nightlife Spot 
## 2    2485 4f110e51e4b~ Skate Park        Athletics & Spor~ Outdoors & Rec~
## 3     295 4f0fd5a8e4b~ Cosmetics Shop    Cosmetics Shop    Shop & Service 
## 4   33134 4f51be1de4b~ Zoo               Zoo               Arts & Enterta~
## 5   52935 4b7b884ff96~ Noodle House      Restaurant        Food           
## 6   11899 4c432425373~ Music Venue       Music Venue       Arts & Enterta~
Rev_Geo$city <-find_pref(POIs$Longitude,POIs$Latitude)\

Then we save the modified table in database using dbWriteTable. Since it is a lookup function in a huge json file, it takes a long time to get the results.
For US, we also repeated same steps, but we get considerable null values for some states because the geojson files were not complete so we manipulated null data manually with looking up them in Latlong.net.
The reason why we are not using google API is the restriction in the count of requests.
***

Simple Analysis

In the simple analysis section, we get a high level preview of the check-ins for the selected country, like drawing a histogram to exploit the trend of check-ins and detect if there are any missing or unexpected patterns, knowing the most checked-in venues in the country and knowing if there are any key factor that is driving the data for example active-users.

checkins_hist <- ggplot(chk_poi_country, aes(x = date(Chk_time))) +
  geom_histogram(bins = 18, col="red", aes(fill=..count..)) +
  scale_fill_gradient("Count", low = "red", high = "green") +
  ggtitle(paste("Checkins Histogram for", country_name)) +
  scale_x_date(labels = date_format("%Y-%b"),date_breaks = "3 months") +
  scale_y_continuous(labels = scales::comma) + 
  ylab("Count Checkins") + 
  xlab("Year and Month") +
  theme_bw()

As you can see in the plot above, not all months have the same amount of check-ins, months vary greatly,which shows that the provided data-set doesn’t cover the whole data of foursquare or skipped some.

# a cumulative density graph based on number of checkins by user
User_chk_count_data <- chk_poi_country %>%
  group_by(User_ID) %>%
  summarize(count = n()) %>%
  arrange(count)

User_chk_count_data$cumsum <- cumsum(User_chk_count_data$count)
User_chk_count_data$prob <- cumsum(User_chk_count_data$count) / sum(User_chk_count_data$count)
User_chk_count_cdf <- ggplot(data = User_chk_count_data, aes(x= seq_along(User_ID),y = prob)) +
  geom_point() +
  ggtitle(paste("Distribution of checkins by users in", country_name)) +
  xlab('Number of users') +
  ylab('Distribution') +
  geom_hline(yintercept = 0.5,linetype="dotted", size = 2, color = "red")

For the checkin-ins distribution, it is clear that there are a small amount of users, around 15-20%, making 50% of the checkins of the country, while the remaining users contribute with the remaining 50%.
There are 9 main categories that contain all venues in foursquare as follows:

# counts of top categories
Top_cat_data <- chk_poi_country %>%
  count(Country_Code, venue_category,sort = TRUE)

# Bar Chart for top categoies ordered by count
Top_cat_bar <- ggplot(Top_cat_data, aes(x = reorder(venue_category, -n), y=n)) +
  geom_bar(stat = 'identity', fill = "Blue") +
  xlab('Venue_Category') +
  ylab('Check-ins count')+
  ggtitle(paste("Top Categories in", country_name)) +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(labels = scales::comma)

# counts of top 10 sub-categories
Top10_subcat_data <- chk_poi_country %>%
  filter(!is.na(venue_category_lvl_one)) %>%
  count(Country_Code, venue_category_lvl_one,venue_category,sort = TRUE) %>%
  head(n = 10)

# Bar Chart for top sub-categoies ordered by count
Top10_subcat_bar <- ggplot(Top10_subcat_data, aes(x = reorder(venue_category_lvl_one, -n), y=n)) +
  geom_bar(stat = 'identity', fill = "Blue")+
  xlab('Venue_SubCategory') +
  ylab('checkin_count')+
  ggtitle(paste("Top-10 Sub-categories in", country_name)) +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(labels = scales::comma)


Temporal Analysis: Temporal analysis enables examining and modeling the behavior of the check-ins over time. It can be used to determine whether and how concentrations are changing over time. The behavior of check-ins over time can be modeled as a function of previous data points of the same series. This can help identify which activities in the selected country are commonly done on different time occasions. For example which are common during weekends vs. weekdays, what are popular day activities vs night activities.

In the following section we will exploit the top categories of the country based on different temporal levels:

* #### Check-ins distribution over hours: The distribution of check-ins along the 24 hours of the day, gives a quickview of which activities are usually done at what times of the day.

# Density chart for distribution of top sub-categories over hours of the day
Top_subcat_distr_hours_data <- chk_poi_country %>%
  filter(chk_poi_country$venue_category_lvl_one %in%
           Top_subcats$venue_category_lvl_one) %>%
  mutate(time = hour(Chk_time) + minute(Chk_time)/60+ seconds(Chk_time)/3600)

Top_subcat_distr_hours_density <- ggplot(Top_subcat_distr_hours_data, aes(x= time, y=..density.. ,fill= venue_category_lvl_one)) +
  geom_density(position = "stack", alpha = 0.6) +
  scale_x_continuous(name="Hours of the day", limits=c(0, 24), breaks = c(0:24))

* #### Check-ins distribution over days: The distribution of check-ins over the days of the week on the other hand gives another point of view regarding what activities are usually done during weekdays and what other activities are done on the weekends. We preferred to plot all 7 days together and not split them to seperate weekdays and weekends charts because weekdays and weekends can vary according to the country, for example in Europian countries weekends are Saturdays and Sundays, while in the Middle East countries weekends are Fridays and Saturdays.

# Multi-bar chart for the distribution of top categories over the week days
Top_subcat_distr_week_data <- chk_poi_country %>%
  filter(chk_poi_country$venue_category_lvl_one %in% Top_subcats$venue_category_lvl_one) %>%
  count(venue_category_lvl_one,checkin_day= wday(Chk_time, label = TRUE))

Top_subcat_distr_week_multibar <- ggplot(Top_subcat_distr_week_data, aes(x= checkin_day,y= n, fill= venue_category_lvl_one)) +
  geom_bar(stat = 'identity', position = "dodge") +
  geom_smooth(method = "lm") +
  ylab("Count Checkins") +
  xlab("Checkin Day")

* #### Check-ins distribution over months: The distribution of check-ins over the months gives even a wider scope of view, one can determine the activities and places people like to visit at different seasons of the year. For the same reason as the distribution over days, we chose not to group months into seasons because seasons in the Southern Hemishpere are the opposite of the Northern Hemishpere.

# for distribution of top sub-categories over months of the year
Top_subcat_distr_month_data <- chk_poi_country %>%
  filter(chk_poi_country$venue_category_lvl_one %in%
  Top_subcats$venue_category_lvl_one) %>%
  count(venue_category_lvl_one, DT= date(paste(year(Chk_time), "-", month(Chk_time), "-01", sep="")))

Top_subcat_distr_month_density <- ggplot(Top_subcat_distr_month_data, aes(x= DT, y= n,fill= venue_category_lvl_one)) +
  geom_area(stat = "identity", alpha = 0.6) +
  scale_x_date(labels = date_format("%Y-%b"), date_breaks ="3 months") +
  #scale_y_continuous(labels = "comma") +
  xlab("Year and Month") +
  ylab("Count Checkins") +
  guides(fill = guide_legend(title = "venue sub-category"))
  theme_bw()

***

Frequent Pattern Analysis

Frequent Pattern Mining is a data Mining subject that aims at finding interesting relationships among the items in a database, in our case items are the venues or venue categories that the users visit. Frequent Pattern Mining defines Frquent itemsets to find those relationships. Frquent itemsets can be defined as the venues that appear in many transactions, then these relationships are presented as a collection of if-then rules, called association rules.
When we thought about using frequent pattern we thought about building a solid base for recommendations for users and businesses based on existing patterns for the country.

Choosing the venue category level to work on: For the frequent pattern mining, we will be using the level one category column “venue_category_lvl_one” because the “Venue Category Name” column of the data-set contains venues from different levels which will result in generation of misleading association rules like “Movie Theater=>Arts & Entertainment” and “Airport Gate=>Airport” which is joining the object with it’s parents in the hierarchy.
Also we excluded using the main venues column “venue_category” because it contains only 9 categories which will result in very trivial association rules and losing the interesting association rules between detailed venues, an example for a rule that will be lost by using “venue_category” is the rules “Salsa Club=>Basketball Stadium” because both venues fall under the category “Arts & Entertainment”.

Transactions Preparation: ‘arules’ is a library that provides an infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). It provides implementations of the association mining algorithms Apriori and Eclat.
In order to process data by ‘arules’, it needs to be stored as instances of the class transaction, this is done as follows:
1.We select from our check-ins data for each user the places he has been to on the same day and place them all in a row separated by commas, this format of storing transactions is called “basket”. 2. We store the data in a csv file. 3. We the csv file as transactions . 4. We pass the transactions to the algorithm.

# Combine all places visited by user in each day in a cell (userid,date,categories)
Freq_pattern_data <- chk_poi_country %>%
  filter(!is.na(nxt_venue_cat_lvl_one)) %>%
  group_by(User_ID,Chk_time = date(Chk_time)) %>%
  arrange(Chk_time) %>%
  summarise(visited  = paste(venue_category_lvl_one, collapse =","))

Freq_pattern_data$User_ID <- NULL
Freq_pattern_data$Chk_time <-NULL
Freq_pattern_data %>% ungroup()

write.csv(Freq_pattern_data,"Trn.csv", quote = FALSE, row.names = FALSE)

Freq_pattern_trn <- read.transactions(file = "Trn.csv", format = 'basket', sep=',', rm.duplicates = FALSE, skip = 1)

Then see a summary of the trasactions using:

## transactions as itemMatrix in sparse format with
##  938643 rows (elements/itemsets/transactions) and
##  220 columns (items) and a density of 0.008319781 
## 
## most frequent items:
##     Train Station        Restaurant     Metro Station Convenience Store 
##            356109            228498             90787             61909 
##     Shopping Mall           (Other) 
##             56822            923922 
## 
## element (itemset/transaction) length distribution:
## sizes
##      0      1      2      3      4      5      6      7      8      9 
##   1511 541975 212421  92683  42922  21150  10792   6142   3491   2034 
##     10     11     12     13     14     15     16     17     18     19 
##   1260    795    500    320    252    134     98     56     29     21 
##     20     21     22     23     24     25     28     29     33 
##     15     12      4     10      6      7      1      1      1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    1.00    1.83    2.00   33.00 
## 
## includes extended item information - examples:
##           labels
## 1        Airport
## 2 Animal Shelter
## 3   Antique Shop

Algorithms comparison: In order to decide between the 2 algorithms implemented in ‘arules’ we ran the following time test to decide which will be better to use for our analysis

start.time <- Sys.time()
association.rules <- eclat(Freq_pattern_trn, parameter = list(supp=0.0001, minlen=3))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE   1e-04      3     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 93 
## 
## create itemset ... 
## set transactions ...[220 item(s), 938643 transaction(s)] done [0.16s].
## sorting and recoding items ... [195 item(s)] done [0.02s].
## creating sparse bit matrix ... [195 row(s), 938643 column(s)] done [0.12s].
## writing  ... [5640 set(s)] done [0.47s].
## Creating S4 object  ... done [0.00s].
## Time difference of 0.8070979 secs
start.time <- Sys.time()
association.rules <- apriori(Freq_pattern_trn, parameter = list(supp=0.0001, conf=0.05, minlen=3))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5   1e-04      3
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 93 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[220 item(s), 938643 transaction(s)] done [0.16s].
## sorting and recoding items ... [195 item(s)] done [0.02s].
## creating transaction tree ... done [0.49s].
## checking subsets of size 1 2 3 4 5 6 done [0.06s].
## writing ... [16541 rule(s)] done [0.00s].
## creating S4 object  ... done [0.07s].
## Time difference of 0.835901 secs

By checking the test it seems that Eclat is performing better with a difference (<= 0.4 sec), but since the difference between the 2 algorithms is very small in this case, we decided to proceed with Apriori, because it provides more detailed information as hwen running the algorithm than Eclat which might finish processing faster but needs further processing to reach the same level of information. Apriori provides in it’s results the left-hand-side, right-hand-side, support, confidence and lift for each rule, while Eclat only provides the whole frequent itemset and it’s support value.

inspect(association.rules[1:10])
##      lhs                           rhs                support     
## [1]  {Food Truck,Restaurant}    => {Train Station}    0.0001033407
## [2]  {Food Truck,Train Station} => {Restaurant}       0.0001033407
## [3]  {Island,Spiritual Center}  => {Train Station}    0.0001044060
## [4]  {Island,Train Station}     => {Spiritual Center} 0.0001044060
## [5]  {Public Art,Restaurant}    => {Train Station}    0.0001171905
## [6]  {Public Art,Train Station} => {Restaurant}       0.0001171905
## [7]  {Casino,Restaurant}        => {Train Station}    0.0001054714
## [8]  {Casino,Train Station}     => {Restaurant}       0.0001054714
## [9]  {Castle,Restaurant}        => {Train Station}    0.0001267788
## [10] {Castle,Train Station}     => {Restaurant}       0.0001267788
##      confidence lift      count
## [1]  0.5914634   1.558997  97  
## [2]  0.4754902   1.953258  97  
## [3]  0.6447368   1.699417  98  
## [4]  0.5025641  27.493197  98  
## [5]  0.6321839   1.666330 110  
## [6]  0.3133903   1.287371 110  
## [7]  0.4400000   1.159765  99  
## [8]  0.3639706   1.495149  99  
## [9]  0.4837398   1.275056 119  
## [10] 0.4250000   1.745850 119


Creating value from generated rules: We based out analysis on the lift factor. Lift is defined as the factor by which, the co-occurence of A and B exceeds the expected probability of A and B co-occuring, had they been independent. So, higher the lift, higher the chance of A and B occurring together. picture and definition taken from Selva Prabhakaran’s blog.

Recommendations for businesses: Using association rules can help as a good base for businesses, for example let’s take a new burger place that wants to know the best potential places to open. Based on our apriori rules we can recommend places based on the lift value from the generated rules.
for exmaple: {Bakery,Donut Shop}=>{Burger Joint}

Burger_rules <- apriori(Freq_pattern_trn, parameter =  list(supp=0.00001,conf=0, maxlen=3),appearance = list(default="lhs",rhs = "Burger Joint"))

As mentioned by DataCamp’s tutorial for apriori using the library ‘arulesViz’ can give a more interesting and interactive graph for displaying apriori rules for our example:

Graph-Based Visualizations: Graph-based techniques visualize association rules using vertices and edges where vertices are labeled with item names, and item sets or rules are represented as a second set of vertices. Items are connected with item-sets rules using directed arrows. Arrows pointing from items to rule vertices indicate LHS items and an arrow from a rule to an item indicates the RHS. The size and color of vertices often represent interest measures.

# create a graphical chart for top 10 rules and items
Top_Burger_rules <- head(Burger_rules, n= 10 , by = "lift")


Recommendations for users: Likeweise, the same way we can use apriori rules to recommed places for users, by defining user-segments (ex: working people or university students) we can recommend venues based on analyzing the lifestyle of those user-segments. This can be achieved by filtering the rules having those segments as the LHS.


Sequence Pattern Analysis

Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of structured data mining.

Definition from Wikipedia’s Sequential_pattern_mining.
We based our sequenial pattern analysis on the idea of frequent 2-itemsets, for this we had to follow a set of steps.

- Group the check-ins data user
- Order the check-ins of the user by date and time
- For each check-in ation of the user, get the next checkin place, longitude, latitude and time of check-in
- Calculate the inter-checkin time (IC_Time) in minutes
- Calculate the inter-checkin distance (IC_Dist) in km using the function distGeo
Note: For the next check-in place we lookup the next checked-in place by the user from original data set(nxt_venue_cat_name), the level one category (nxt_venue_cat_lvl_one) and the main category (nxt_venue_cat)

# add columns calculating the difference between consecutive checkins in terms of time (min) and distance (km)
chk_poi_country <- chk_poi_country %>%
  group_by(User_ID) %>% 
  arrange(Chk_time) %>% 
  mutate(nxt_chk_time = lead(Chk_time),
         IC_Time = (nxt_chk_time - Chk_time) / 60,
         nxt_chk_lon = lead(Longitude),
         nxt_chk_lat = lead(Latitude),
         IC_Dist = distGeo(matrix(c(Longitude, Latitude), ncol = 2),matrix(c(nxt_chk_lon, nxt_chk_lat), ncol = 2))/1000,
         nxt_venue_cat_name = lead(Venue_Category_Name),
         nxt_venue_cat = lead(venue_category),
         nxt_venue_cat_lvl_one = lead(venue_category_lvl_one)
        )

chk_poi_country <- ungroup(chk_poi_country)

A sample of the data after adding the next venue columns

chk_poi_country %>% head() %>% select(User_ID,Venue_Category_Name,nxt_venue_cat_name,venue_category_lvl_one,nxt_venue_cat_lvl_one,venue_category,nxt_venue_cat)
## # A tibble: 6 x 7
##   User_ID Venue_Category_~ nxt_venue_cat_n~ venue_category_~
##     <int> <chr>            <chr>            <fct>           
## 1  141635 Bar              Noodle House     Bar             
## 2    2485 Skate Park       Home (private)   Athletics & Spo~
## 3     295 Cosmetics Shop   Furniture / Hom~ Cosmetics Shop  
## 4   33134 Zoo              Convenience Sto~ Zoo             
## 5   52935 Noodle House     Grocery Store    Restaurant      
## 6   11899 Music Venue      Japanese Restau~ Music Venue     
## # ... with 3 more variables: nxt_venue_cat_lvl_one <fct>,
## #   venue_category <fct>, nxt_venue_cat <fct>

In our model we use the library (networkD3) for the visualization for our sequenced rules, and specifically a flow chart graph type called “Sankey diagram”.
To prepare the rules to be plotted, we do grouping on the check-in venue and it’s next check-in venue, and then we do a count operation. Then based on a combination of selected support, we select the rules satisfying these selected support values and plot them using the sankey diagram.
For the data below, we combine 20 rules that staisfy a support of 0.0002, 15 rules that satisfy a support of 0.0004 and 10 rules that satisfy a support of 0.0007 then plot them together.
Since each link between two venue categories represents a frequent 2-itemset rule, the graph shows possible combinations of different rules together. What we find interesting about using this method is that it might produce sequences that don’t appear in the data-set, and by choosing appropriate support values, we can compare them to frequent itemsets from apriori and choose non existing rules as a base for recommendations for users. Then feedback can be collected from the users by rating those suggestions, for clarfication we will give an example after plotting the graph.

# create a Sankey diagram to show frequent movement from one sub-category to the other
# sankey for level one categories
total <- chk_poi_country %>%
  filter(!is.na(nxt_venue_cat_lvl_one),
         !( venue_category_lvl_one == nxt_venue_cat_lvl_one)) %>%
  group_by(venue_category_lvl_one,nxt_venue_cat_lvl_one) %>%
  summarize(Count = n()) %>%
  ungroup()

supp <- 0.0002
supp_count <- sum(total$Count) * supp

supp_1 <- 0.0004
supp_count_1 <- sum(total$Count) * supp_1

supp_2 <- 0.0007
supp_count_2 <- sum(total$Count) * supp_2

Sequence_chk_agregated_level_one <- chk_poi_country %>%
  filter(!is.na(nxt_venue_cat_lvl_one),
         !( venue_category_lvl_one == nxt_venue_cat_lvl_one)) %>%
  group_by(venue_category_lvl_one,nxt_venue_cat_lvl_one) %>%
  summarize(Count = n()) %>%
  ungroup() %>%
  filter(Count >= supp_count) %>%
  top_n(20,-Count)

Sequence_chk_agregated_level_one_1 <- chk_poi_country %>%
  filter(!is.na(nxt_venue_cat_lvl_one),
         !( venue_category_lvl_one == nxt_venue_cat_lvl_one)) %>%
  group_by(venue_category_lvl_one,nxt_venue_cat_lvl_one) %>%
  summarize(Count = n()) %>%
  ungroup() %>%
  filter(Count >= supp_count_1) %>%
  top_n(15,-Count)

Sequence_chk_agregated_level_one_2 <- chk_poi_country %>%
  filter(!is.na(nxt_venue_cat_lvl_one),
         !( venue_category_lvl_one == nxt_venue_cat_lvl_one)) %>%
  group_by(venue_category_lvl_one,nxt_venue_cat_lvl_one) %>%
  summarize(Count = n()) %>%
  ungroup() %>%
  filter(Count >= supp_count_2) %>%
  top_n(10,-Count)

Sequence_chk_agregated_level_one <- rbind(Sequence_chk_agregated_level_one,Sequence_chk_agregated_level_one_1,Sequence_chk_agregated_level_one_2)

Sequence_chk_agregated_level_one$venue_category_lvl_one <- as.character(Sequence_chk_agregated_level_one$venue_category_lvl_one)
Sequence_chk_agregated_level_one$nxt_venue_cat_lvl_one <- as.character(Sequence_chk_agregated_level_one$nxt_venue_cat_lvl_one)

For the sankey diagram to work it needs to be provided with nodes and links.
Nodes is the list of venue categories with an ID specified for each venue category.
While links is given a source node, a target node and a value parameter which is the count in our case.

# create a list of unique activities and assign IDs to them (we combine both venue and next venue
# in case a venue was always last visited it will be then excluded)
activities_id_level_one <- (1:length(unique(c(Sequence_chk_agregated_level_one$venue_category_lvl_one, Sequence_chk_agregated_level_one$nxt_venue_cat_lvl_one)))) - 1
# set the names value of IDs to be the name of the venue
names(activities_id_level_one) <- unique(c(Sequence_chk_agregated_level_one$venue_category_lvl_one, Sequence_chk_agregated_level_one$nxt_venue_cat_lvl_one))

# add column source and target to 
Sequence_chk_agregated_level_one$source <- activities_id_level_one[Sequence_chk_agregated_level_one$venue_category_lvl_one]
Sequence_chk_agregated_level_one$target <- activities_id_level_one[Sequence_chk_agregated_level_one$nxt_venue_cat_lvl_one]

nodes_level_one <- data.frame(node=activities_id_level_one, name=names(activities_id_level_one))
links_level_one <- data.frame(source=Sequence_chk_agregated_level_one$source, target=Sequence_chk_agregated_level_one$target, value=Sequence_chk_agregated_level_one$Count)

Now that the data is prepared, let’s plot the sankey diagram.

networkD3::sankeyNetwork(Links = links_level_one, Nodes = nodes_level_one, 
                         Source = 'source', 
                         Target = 'target', 
                         Value = 'value', 
                         NodeID = 'name',
                         units = 'count', 
                         nodePadding = 5,
                         width = 1500,
                         height = 1000,
                         fontSize = 20)

Considering the graph above, let’s consider the path {Bookstore => Coffee Shop=> Park=> Metro Station=> Museum}, this path might not be appearing in our data-set as a frequent 5-itemset, but by accepting support values of 0.0007, 0.0004 and 0.0002 for our data-set and knowing that {Bookstore => Coffee Shop} has a support of 0.0004, {Coffee Shop=> Park} has a support of 0.0002, {Park=> Metro Station} has a support of 0.0007 and {Metro Station=> Museum} has a support of 0.0002. Based on the previous assumption of accepting those support values we can use this analysis as an input to recommender systems and businesses.

### Spatial Clustering Division of the region based on activity density which activity is defined by venue category. This leads us to Spatial data Clustering:
“activity zone” is a term which we assign to each cluster.
Spatial clustering can be simplified as a vector with two values like x,y which in this case is represented with longitude and latitude of Venues. In other words spatial clustering problem is exactly like clustering of 2-d vectors. This clustering also can be done via density-based methods or distance-based methods.
Distance-based methods have two weakness which leads to be not suitable for spatial data clustering, first they need to specify the number of clusters as an input and second they allocate all objects to the clusters and never identify noises. Although we are aware of mentioned weak points, we first try dbscan as a method of density based and later k-means as a candidate of distance based. For simplicity we choose the data of one candidate prefecture from Japan check-ins which here is Hiroshima.

DBSCAN clustering:
DBSCAN detects clusters which are present in the data space as dense areas. These clusters are differentiated from regions with low density (noises).clusters may have an arbitrary shape and the points inside a cluster may be arbitrarily distributed.
DBSCAN requires two parameters: eps(size of the epsilon neighborhood) and the minimum number of points required to form a cluster (minPts). A node which does not meet minPts in its eps neighbourhood and also is not located in a sufficiently sized eps environment of a different point, will be labeled as noise(more detailed description of this algorithm is beyond the scope of this work).
Minpts is often set to be dimensionality of the data plus one or higher as per rdocumentation.org. In our case we tried both 3 and 5, which gives almost the same result. To find the suitable value for eps the knee in kNNdistplot can be used.

dbs %>% head()
## # A tibble: 6 x 5
##   Venue_Category_Name Venue_ID                 Latitude Longitude cluster
##   <chr>               <chr>                       <dbl>     <dbl>   <int>
## 1 Art Museum          4b2204abf964a5201d4324e3     34.4      132.       1
## 2 Castle              4b2204d6f964a5201f4324e3     34.4      132.       1
## 3 Train Station       4b269592f964a520e57d24e3     34.4      132.       1
## 4 Library             4b55168cf964a520b9da27e3     34.3      132.       1
## 5 Train Station       4b563509f964a5204f0528e3     34.2      133.       2
## 6 Train Station       4b563648f964a520900528e3     34.3      133.       2
kNNdistplot(dbs[,3:4], k = 5 ) 
#where the k is number of nearest neighbors used (use minPoints)
abline(h = 0.016, col="red",lty = 2)

#We can see that knee is around a distance of 0.016.

Now we can do the clustering:

clusters <- dbscan(select(dbs, 'Latitude', 'Longitude'), eps = 0.016)
dbs$cluster <- clusters$cluster
groups  <- dbs %>% filter(cluster != 0)
noise  <- dbs %>% filter(cluster == 0)

Which plots look like- black points are the noises:

Let’s have a look on results plotted on the map:


From the visualization we conclude that the output of dbscan for other clusters except from the cluster 1 (green-blue one ) is the desired one.Cluster 1 is so huge and could be broken down to smaller areas.

K-means clustering:
Let’s try it with k-means and see the difference:
We take the advantage of DBSCAN results and excluded the noises from the dataset.So we used groups dataset for k-means instead of dbs.

km <- kmeans(groups_km[,3:4],centers = 3, nstart = 5)
groups_km$cluster <-km$cluster

If we took silhouette metric to choose appropriate cluster count, it will show us that k=3 results the max average silhouette. Also with repeating clustering with K=5,9, we realized that the main difference- partitioning afterwards is happening at the biggest cluster(on the map green area in the middle) which is mostly the same area of cluster one in DBSCAN. Consequently we decided to combine both clustering methods. Take the advantage of DBSCAN for smaller dense areas and apply k-means on bigger clusters resulted from DBSCAN.


Temporal Clustering

We are interested in finding similar patterns of check-in frequency over time dimension.
This leads us to temporal clustering. Check-ins for each venue based on time dimension can be represented as time series.
Time series clustering: let’s assume that for each activity(venue category) there is a timeserie of check-ins over the defined time dimension. Which time dimension can be based on different granularity levels: - Time intervals in a day - Type of days (workday-weekday) - Seasons - Or other informative level Here we just focus on the first level - considering data as hourly check-ins.

data preparation-manipulations for further steps:
- Since we are not interested in whole categories, we will choose Top K=30 popular venues for a specific country dataset.
- Grouping checkins based on those popular venues and meanwhile based on hour of the checkin.
- Making pivot table and consequently matrix from it.

time_cat_matrix %>% head()
##               0   1   2   3   4   5   6    7    8    9   10  11   12   13
## Arcade      203  71  26  30   9   7  74   47  107  362  683 722  995 1247
## Bar         777 414 267 151  92  60  26   36   32   54   44  96  327  273
## Bookstore    70  22  18   6   1  13  38   90  105  150  566 834 1364 1551
## Bridge      556 271 126  98 118 297 635 1610 2219 1336  943 915 1059 1177
## Building    144  40  41  26  40  94 166  632 1735 1458  675 684 1180 1240
## Bus Station 230  66  22  22  41 331 745 2159 2400 1341 1109 957 1183 1159
##               14   15   16   17   18   19   20   21   22   23
## Arcade      1684 2286 2745 3382 3719 3245 2612 1933 1310  659
## Bar          185  213  285  693 1500 2435 2523 2231 1960 1359
## Bookstore   1812 2254 2906 3591 4612 3951 2696 1409  603  228
## Bridge      1104 1355 1677 2186 2590 2175 1654 1295 1033  906
## Building    1094 1239 1299 1623 1880 1526  989  598  435  321
## Bus Station 1173 1526 2083 2668 3317 2972 2333 1976 1539  872

To partition time series data into groups based on similarity or distance, so that time series in the same cluster are similar, we have to make decisions in following areas:
- Scale Type
- Distance Type
- Clustering Method

Scaling: to make the variables comparable since they are not in the same scale we have to do scaling. Two possible options are normalization and standardization. Here we chose standardization.The values in each row are standardized by subtracting their minimum and dividing by their range.By doing this the data will be scaled to mean=0, sd=1.

time_cat_matrix_scaled <- t(scale(t(time_cat_matrix))) 
## Centers and scales matrix row-wise
heatmap(time_cat_matrix_scaled,Colv=NA, scale='none',col=palette,xlab = "Hours", ylab = "Category")


This heatmap shows that how selected activities are corrolated based on time dimension and density of the check-ins. For example we can see venues like Bar and sake bar are late night activities.Or people start checking in stores starting from 10AM which becomes so crowded between 5-7 PM which is mostly the end of the working day. Transportaton related venues have two most visiting periods of time. One in the morning 7-9AM and the other one in the evening around 6PM which is a prove of moving people between their working and leaving areas.
Distance measure:To use this scaled matrix later in hierarchical clustering we have to derive distance matrix from it.here we calculate two distances:
1. Correlation-based distance

time_cat_cor <- cor(t(time_cat_matrix_scaled), method="pearson")
time_cat_cor_dist <- as.dist(1-time_cat_cor)

2-euclidean distance

time_cat_eucl <- dist(time_cat_matrix_scaled, method = "euclidean")
fviz_dist(time_cat_eucl, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))


This visulaization of distance matrix indicates the venues which are most similar to each other (smallest distance) with green color. Beside from 4 big green squares along the diagonal,there are other few venues which with weak degree are similar to each other. Also we can see that universities are not similar to Bars which the dark orange color respresents it.

Clustering method:
Candidates:
Partitioning algorithms: k-means, PAM (Partitioning Around Medoids)
Hierarchical algorithms: agglomerative clustering (hclust)
We will mostly focus on k-means and hierarchical agg. Clustering.
We already know drawbacks of the k-means. We should have sense of number of clusters in advance. Also it is making some assumptions about the distribution of data.Results can be highly affected with the initial cluster centroids.On the other hand, result of hclust is plotted as dendogram which we can use it to make decisions at which height we are going to cut the tree to form our clusters. Even by considering these points we will examine both methods for our data. To have a comparison between these methods we will perform a validation test using clValid package as per sthda.com.
clValid is offering 3 different types of clustering validation measures:
- Internal validation
- Stability validation
- Biological validation
Here we mostly focus on internal validation measures which contains following items(for details about the other ones you can refer to the mentioned web site):
Connectivity: degree of connectedness of the clusters,corresponds to what extent items are placed in the same cluster as their nearest neighbors in the data space, the range of value is between 0 and infinity and lower values are prefered.
The other two ones combine measures of compactness and separation of the clusters:
Average Silhouette width
Silhouette coefficient as per datanovia:
The silhouette analysis measures how well an observation is clustered and it estimates the average distance between clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters. Values of silhouette width ranges from [-1,1] - poorly clustered to well clustered.
Dunn index : as per wikipedia the aim is to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance.It has a value between 0 and infinity and bigger values are prefered.
summary of validity test:

summary(valid_test)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                                  2       3       4       5       6       7       8       9      10
##                                                                                                   
## hierarchical APN            0.0361  0.0149  0.0429  0.0364  0.0256  0.0219  0.0361  0.0176  0.0259
##              AD             2.9561  2.2351  2.0174  1.5324  1.2351  1.0918  1.0070  0.8884  0.8324
##              ADM            0.1896  0.0842  0.1756  0.1755  0.1670  0.0716  0.1201  0.0581  0.0749
##              FOM            0.4559  0.3572  0.3460  0.2838  0.2212  0.2102  0.2019  0.1880  0.1848
##              Connectivity   5.4913 10.7361 14.3274 19.7845 25.1242 26.4575 29.5190 32.4190 34.4190
##              Dunn           0.2440  0.3668  0.3726  0.4224  0.4343  0.4366  0.4366  0.5274  0.5274
##              Silhouette     0.3290  0.3574  0.3695  0.4373  0.4768  0.4493  0.4143  0.4028  0.3698
## kmeans       APN            0.0426  0.0000  0.0404  0.0081  0.0203  0.0094  0.0329  0.0111  0.0246
##              AD             2.6402  2.1166  1.9198  1.3459  1.1906  1.0467  0.9671  0.8506  0.8034
##              ADM            0.2437  0.0000  0.1553  0.0342  0.0822  0.0280  0.1328  0.0473  0.0785
##              FOM            0.4338  0.3405  0.3340  0.2519  0.2156  0.2071  0.2014  0.1883  0.1885
##              Connectivity   9.2433 14.7516 18.3429 20.8512 25.1242 27.8159 35.4901 38.0651 40.0651
##              Dunn           0.2835  0.3783  0.3844  0.4404  0.4343  0.4784  0.4121  0.4573  0.4573
##              Silhouette     0.2929  0.3373  0.3633  0.4905  0.4768  0.4700  0.4427  0.4389  0.4059
## pam          APN            0.1439  0.0166  0.0024  0.0045  0.0148  0.0080  0.0157  0.0424  0.0275
##              AD             2.7528  2.0344  1.5799  1.3450  1.2157  1.0560  0.9443  0.9118  0.8084
##              ADM            0.6560  0.0739  0.0116  0.0290  0.0511  0.0220  0.0536  0.1316  0.0780
##              FOM            0.4638  0.3535  0.2842  0.2546  0.2411  0.1982  0.1944  0.1952  0.1807
##              Connectivity   7.8135 16.2194 17.2599 20.8512 23.2679 27.9159 35.5901 36.6984 37.6762
##              Dunn           0.2827  0.2462  0.3224  0.4404  0.5622  0.4724  0.3941  0.4012  0.4012
##              Silhouette     0.2930  0.3633  0.4475  0.4905  0.4735  0.4541  0.4283  0.3963  0.3726
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## APN          0.0000 kmeans       3       
## AD           0.8034 kmeans       10      
## ADM          0.0000 kmeans       3       
## FOM          0.1807 pam          10      
## Connectivity 5.4913 hierarchical 2       
## Dunn         0.5622 pam          6       
## Silhouette   0.4905 kmeans       5

since internal validation metrics do not help us to choose between methods in this case, we decided to do experiments for both kmeans and hierarchical clusterings.
K-means clustering
Determining the optimal number of clusters
Although validation test provides also the number of clusters for related measure,we can also run other test using following popular methods as per datanovia:
1- Elbow method: applying different K(number of clusters) and calculating total within-cluster sum of square (WSS).The total WSS measures the compactness of the clustering and we want it to be as small as possible. Consequently this value should be minimized and we will stop at that point where adding new cluster is not helping to decrease it.

fviz_nbclust(time_cat_matrix_scaled, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")


the elbow method is sometimes ambiguous. An alternative is the average silhouette method.
2- Average silhouette method
Average silhouette method computes the average silhouette of observations for different values of k. Within desired range of K, the best one is that which maximizes the average silhouette.

fviz_nbclust(time_cat_matrix_scaled, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")


3- Gap statistic method: here we will not use this method since the result for best K was really different from 2 other methods and after visualizing the clusters based on it we realized it is not our desired clustering result. (for more details you can refer to the mentioned website.)

Based on the results we choose candidate(s) for K - in this case k=5.We focus on silhouette but any other validation measures can be chosen as well.

km.5 <- eclust(time_cat_matrix_scaled, "kmeans", k = 5, nstart = 25, graph = FALSE)
fviz_silhouette(km.5, palette = "jco", ggtheme = theme_classic())
##   cluster size ave.sil.width
## 1       1    5          0.24
## 2       2    8          0.55
## 3       3    9          0.57
## 4       4    6          0.52
## 5       5    2          0.47

if in ‘clusters silhouette plot’ negative values appear this is the indicator of misclassification.
Applying k-means for 5 clusters:

time_cat_kmeans<- kmeans(time_cat_matrix_scaled, 5, nstart = 25)

which contains form following clusters:

## [[1]]
## [1] "Arcade"             "Bookstore"          "Cafeteria"         
## [4] "Department Store"   "Electronics Store"  "Grocery Store"     
## [7] "Hobby Shop"         "Mall"               "Miscellaneous Shop"
## 
## [[2]]
## [1] "Bar"      "Sake Bar"
## 
## [[3]]
## [1] "Bridge"            "Bus Station"       "Convenience Store"
## [4] "Hotel"             "Platform"          "Road"             
## [7] "Subway"            "Train Station"    
## 
## [[4]]
## [1] "Building"    "Coffee Shop" "Office"      "Park"        "University" 
## 
## [[5]]
## [1] "Chinese Restaurant"    "Fast Food Restaurant"  "Italian Restaurant"   
## [4] "Japanese Restaurant"   "Ramen /  Noodle House" "Restaurant"

Also we can plot clusters based on first two principal components coordinates:

fviz_cluster(time_cat_kmeans, data=time_cat_matrix_scaled,geom = c("point","text"),main = 'k-means clusters')


And here is the visualization of time series in each cluster:

explanation can be added later
Hierarchical clustering
In agglomerative hierarchical clustering there are different aggregation methods which in this project we consider the following ones:

Single: The distance between two clusters C1 and C2 is the minimum distance between two points x and y, which belongs to C1 and C2 respectively.
Drawback: even though many of the elements in each cluster may be very distant to each other, these single element causes these clusters to be merged and consequently chaining phenomenon happens.

Complete : The distance between two clusters C1 and C2 is the maximum distance between two points x and y, with x in C1,y in C2.

Average : The distance between two clusters C1 and C2 is the mean of the distances between the pair of points x and y, where x in C1,y in C2.

Ward : according to rdocumentation,Ward method minimizes the total within-cluster variance. At each step the pair of clusters with minimum cluster distance are merged. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging.
Cophenetic Distances for a Hierarchical Clustering
We choose one of mentioned methods based on cophenetic distance.
based on stat.ethz.ch the cophenetic distance between two observations that have been clustered is defined to be the intergroup dissimilarity at which the two observations are first combined into a single cluster. High correlation between the original distance matrix and the cophenetic distance matrix shows that the dendrogram is an appropriate summary of the data.
Our experiments shows that average linkage provides highest value (0.7784401) and we chose this method. Which the dendogram as is follows.

#coloring represents the clusters resulted from tree cut off k=5
fviz_dend(time_cat_hclust_average, cex = 0.5, k = k_hclust, color_labels_by_k = TRUE, hang = -1)


Challenges

One of the sections which took from us an unexpected amount of time was the data preparation. In each step that we were progressing, we realized some changes are needed in data or something is missing there. That hindered us from reaching the level of detailed analysis that we promised. Also the amount of data was another barrier which caused us to do our analysis based on one country. Although we could have run the same code on different datasets of countries, the comparison between them would have needed more time.

***

Conclusion and future work

In conclusion, our analysis was able to identify trivial data patterns and other interseting potential patterns in each country. The trivial patterns were assuring to us that the analysis is working correctly. For example we found stereotypical lifestyles, so people in Japan have most of their check-ins in trains stations, restaurants and metro stations, while people in United States happen to check-in more into restaurants, bars and offices, also top check-ins in Turkey included Cafeterias, restaurants and shopping malls.
Beside finding the stereotypical lifestyles, we also found some interesting predictions from apriori results with high lift value. Using temporal and spatial clustering, we managed to discover which venue activities are combined together based on time and location independently.

In the end we had the idea of combining our temporal and spatial clustering techniques with the frequent and sequence pattern mining results, but unfortunately we didn’t have enough time to do that so we will mention it as a future work recommendation. The main idea was to combine each of the techniques as layers or filters above the data. If we take for example the data of “Hiroshima”, we can apply spatial clustering as a filtering layer for defining areas of activity zones. Then we can add the result of temporal clustering as a second filtering layer for defining activity times on each of the zones defined in spatial clustering.
This will generate more meaningful patterns that can be combined with frequent itemset results from the apriori algorithm, to be used as input for recommender system engines.


References


[^1]: Zheng, Yu, and Jason Hong. “The Preface of the 4th International Workshop on Location-Based Social Networks.” Proceedings of the 2012 ACM Conference on Ubiquitous Computing. ACM, 2012..